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Swift 


Mirantis, 2012 


Goals 


e Understand when the Swift is applicable 
e Understand Swift architecture 
e Beable to plan Swift deployment 
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Swift 


"Swift is a highly available, distributed, 
eventually consistent object/blob 
store” 
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Swift capabilities 


e Fully-Distributed 
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Swift capabilities 


© 
e Multi-tenancy 
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Swift capabilities 


e Files management via REST API 
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Swift capabilities 


3x+ data redundancy 
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Swift capabilities 


Leverages commodity HW 
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Swift capabilities 


RAID is not required 
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Swift capabilities 


Built-in audit of drives 
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Swift high-level architecture 


Proxy servers 
Account servers Container 
servers 
Objects servers 
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Swift d 
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etailed architecture 


Active Directory 
Service 


ANIA LDAPS 
entity Service 
SOC (Keystone) 


Locations 
Tokens 


internal 
request 


Account 
Ring 


Container 


Ring 


okup node, 
device, partition 
by path 


/ 


Storage Node 1 


LZ | -—-— Re 
= Audit Ą 
Partition Directory = 
Structure 
2 - 


- 
- 


Container 


Ring 


- 
- 


losco Object Replicator 


Lookup node 
device, partition 
by path 


Partition Directory 


© Mirantis, Inc, 2012. All rights reserved. 


Step 1: Upload file to proxy via AP 


User uploads file using via 
Swift API interface: 
POST http: 

//swift/acc/cont/filea 


mc 


Idev/sdb1 Idev/sdb2 Idev/sdb3 Idev/sdb4 Idev/sdb5 
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Swift proxy servers 


Exposes REST API 

Ties together the rest of Swift architecture 
Handles errors 

Doesnt caches objects - proxies them directly 
Uses the Ring to route requests 
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Step 2: Calculate where file should go 


md5(http://swift/acc/cont/file) 
* z1-ip:port/device/copy1 
* Z2-ip:port/device/copy2 
* Z2-ip:port/device/copy3 


mm 


Idev/sdb1 Idev/sdb2 Idev/sdb3 Idev/sdb4 Idev/sdb5 


Kar 
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The Ring 


e Determines where the data should reside in 
the cluster 
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The Ring 


e Maintains this mapping using zones, devices, 
partitions, and replicas 
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The Ring 


e There isa copy of the Ring on every node of 
the cluster 
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The Ring 


e Rings are statically build and assigned 
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The Ring 


e Ring is represented as a pickled and gzipped 
data structure 
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The Ring data 


fe eretta 


TCP port the server uses to serve requests for the device 


disk name of the device in the host system, e.g. sdal. It is 
used to identify disk mount point under /srv/node on the 
host system 


general-use field for storing arbitrary information about 
the device. Not used by servers directly 
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The Ring data 


DI 
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Partitioning 


partition power - number of bits from MD5 
hash 


e Partition power = number of partitions 
e Parts of cluster communicate within whole 


MIRANTIS 


partitions, not individual files 
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RingBuilder 


e Create rings 


swift-ring-builder <builder file> create <part_power> <replicas> 
<min_part_hours> 


e Add devices 


swift-ring-builder <builder file> add z<zone>-<ip>: 
<port>/<device name> <meta> <weight> 


e Verify consistency of ring file 


swift-ring-builder <builder file> 


e Rebalance rings 


swift-ring-builder <builder file> rebalance 
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Step 3: Upload file to servers 


Proxy redirects file to 3 

devices, calculated by the 

Ring 
mc 


Idev/sdb1 Idev/sdb2 Idev/sdb3 Idev/sdb4 Idev/sdb5 
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Object Server 


Blob storage for objects on local devices 
Metadata is stored in xattrs (requires XFS) 
Supports object versioning 

Supports object expiration 
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Accounts and containers 


e Container - listing of objects 
e Account - listing of containers 
e Store information as a SOLite database 
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Step 4: Replicate 


Replicator reads the data 
from the Ring and compares 
partitions on all nodes. If 


inconsistency is detected - 
replicate. 


Idev/sdb5 
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Replication 


e Attempts to replicate object state, If detects 
corruption 

e Asks for more nodes’ if drive of one of the 
replicas has failed 

e Manages deleted records (Tombstone cleanup 
when consistency window has expired) 
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DB Replication 


1. Calculate hashes equality between no and 
previous sync time 

2. If hashes are different - try to inject missing 
rows 

3. If DB is unavailable at all -replicate complete 
database 
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Object Replication 


1. Calculate per-partition hash file 
2. Compare with per-partition hash files for 


other replicas 
3. Replicate only partitions which hash-files are 


different 
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Large object support 


1. Define size of segments for uploading 

2. Start uploading to container X will create 
X segments Container for all segments 

3. Segment name format is 


<name>/<timestamp>/<size>/<segment> 


Example 
swift upload test container -S 1073741824 large file 


Segments 
large file/1290206//8.25/214/4836480/00000000 
large file/1290206778.25/21474836480/00000001 
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Auditors and Updaters 


Updaters 


e if update failed - queue it locally 
e updater to pick up update message and 
execute when possible 


Auditors 


e crawl local objects for integrity 
e marks broken as ‘quarantined’ until next 
replication 
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swift-recon 


e Monitors key swift components, including 
o md5 sums of Rings 
o RLA 
o daemon-specific metrics 


e Implemented as middleware and is deployed 
on every nodes 


e Has its own expandable API 
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Auth 


e Can be used with Keystone 
e Has its own auth system for stand-alone 
installation 
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Swift ACLs 


e Read & Write ACLS, set with swift post -r/-w 
e Referrer 

o .r:* (all referrers) 

O .r:.allowed.com (only from allowed.com) 

O .r:-.not-allowed.com (not from not-allowed.com) 
e Accounts/Users 

O account - for all users in account 

O account:user - Only for specified user 
e ACLs can be chained (last one wins) 
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Swift cluster deployment 


Proxy Proxy 
Server Server 


Object Object Object 
Server Server Server 
Object Object Object 
Server Server Server 
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Swift deployment notes 


e Proxy servers are more CPU and network 1/0 
intensive 

e Account/Container/Object servers are more 
disk and network I/O intensive 

e Plan for relatively small partitions 

e Do not use RAID on servers 
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Swift operations notes 


e Long-term drive failure - rebuild the Ring to 
remove drive 


e Short-term drive failure - don't rebuild the 
Ring 
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Swift 1.7.0 and higher 


e pickle is replaced with custom serialization in 
C 

e Usage statistics for all daemons can publish 
logs in statsd log format 
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